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Abstract 

De-duplication — identification of distinct records refer- 
ring to the same real- world entity — is a well-known chal- 
lenge in data integration. Since very large datasets pro- 
hibit the comparison of every pair of records, block- 
ing has been identified as a technique of dividing the 
dataset for pairwise comparisons, thereby trading off re- 
call of identified duplicates for efficiency. Traditional de- 
duplication tasks, while challenging, typically involved 
a fixed schema such as Census data or medical records. 
However, with the presence of large, diverse sets of struc- 
tured data on the web and the need to organize it effec- 
tively on content portals, de-duplication systems need to 
scale in a new dimension to handle a large number of 
schemas, tasks and data sets, while handling ever larger 
problem sizes. In addition, when working in a map- 
reduce framework it is important that canopy formation 
be implemented as a hash function, making the canopy 
design problem more challenging. We present CBLOCK, 
a system that addresses these challenges. 

CBLOCK learns hash functions automatically from at- 
tribute domains and a labeled dataset consisting of du- 
plicates. Subsequently, CBLOCK expresses blocking 
functions using a hierarchical tree structure composed 
of atomic hash functions. The application may guide 
the automated blocking process based on architectural 
constraints, such as by specifying a maximum size of 
each block (based on memory requirements), impose dis- 
jointness of blocks (in a grid environment), or specify 
a particular objective function trading off recall for effi- 
ciency. As a post-processing step to automatically gen- 
erated blocks, CBLOCK rolls-up smaller blocks to in- 
crease recall. We present experimental results on two 



large-scale de-duplication datasets at Yahoo! — consisting 
of over 140K movies and 40K restaurants respectively — 
and demonstrate the utility of CBLOCK. 

1 Introduction 

Integrating data from multiple sources containing over- 
lapping information invariably leads to duplicates in the 
data, arising due to different sources representing the 
same entities (or facts) in slightly different ways; e.g., 
one source says "George Timothy Clooney" and another 
says "G. Clooney". The problem of identifying different 
records referring to the same real- world entities is known 
as de-duplicatioj^ De-duplication has been identified as 
an important problem in data integration, and has enjoyed 
significant research interest, e.g. ifTTl [T2l [T3l l24l l271 l29l 

El. 

Conceptually, de-duplication may be performed by 
considering each pair of records, and applying some 
matching function (121 [21] [311 to compute a similarity 
score, then determining duplicate sets of records based 
on clustering similar pairs. However, comparing all pairs 
of records to be de-duplicated is prohibitively expensive 
in commercial or web applications that require match- 
ing data sets with millions of records (e.g., persons, 
business listings, etc). Blocking or canopy-formation 
(e.g., la [71 [Bl [nl [El 122 [231 El) has been identified 
as a standard technique for scaling de-duplication: The 
basic idea is to find a set of (possibly overlapping) sub- 
sets of the entire dataset (called blocks), and then com- 
pute similarity scores only for pairs of entities appearing 

^De-duplication is also known by many other names such as refer- 
ence reconciliation, record linkage, and entity resolution. 
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in the same block. We use the term "blocking function" 
to refer to any function that maps entities to block num- 
bers, usually based on the value of one or more attributes. 
One example of a blocking function would be the value 
of the "phone number" attribute, or the first seven digits 
of the same, etc. In an ideal situation, all (or most) of the 
duplicates would appear together in at least one block. 

As a result, a good blocking function must be de- 
signed for each large-scale matching task. We are seek- 
ing to build a scalable system for de-duplication of web 
data. The system will be used for a wide variety of de- 
duplication tasks, and must support agility, the ability 
to rapidly develop new de-duplication applications. Ac- 
cordingly, an important part of developing this system is 
effective, automatic construction of blocking functions. 
Like 1 30 1, de-duplication tasks in our system execute in a 
map-reduce framework like Hadoop. In this setting, com- 
putation is broken into rounds consisting of a map phase 
in which a set of keys is generated by which work is split 
over a potentially large number of compute nodes and a 
reduce phase in which partial results from each compute 
node are combined. A natural approach for de-duplication 
is to use the map phase to execute the blocking function, 
allowing match scores to be computed in parallel on each 
mapper. 

In order to design appropriate blocking functions for 
our setting, we face four important challenges. First, a 
premium is placed on minimizing the number of rounds 
of computation in a map-reduce setting, since each round 
involves significant scheduling and co-ordination over- 
heads. Second, data in our system comes from a vari- 
ety of feeds, and is often noisy. In particular, attributes 
may be only partially populated, leading to asymmetric 
block sizes if these attributes are used for blocking. Third, 
matching is executed in parallel, meaning that a premium 
is placed on minimizing the size of the largest block with- 
out exceeding the maximum number of compute nodes 
available. Fourth, the complexity of the de-duplication 
process can be significantly reduced if every object is 
given only a single hash value for mapping; which we 
refer to as the disjoint blocking condition. 

We present CBLOCK, a system that automatically cre- 
ates canopies based on the information specified by the 
application. We now describe the approach taken in 
CBLOCK to address the above challenges. We introduce 
a conditional tree of blocking functions, the BlkTree. In 
this tree, blocks with large expected size are explicitly 
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Figure 1 : Components of the CBLOCK system 



mapped to a child blocking function, making each path in 
the tree equivalent to a conjunctive blocking function ap- 
plied to a subset of the data. The introduction of the Blk- 
Tree allows for an expressive blocking function, which 
allows us to effectively block even skewed data, such as 
attributes with many null values. 

Second, to handle the situation in which the number of 
blocks exceeds the number of compute nodes, we intro- 
duce a roll-up step for the BlkTree to efficiently reduce the 
number of compute nodes without excessively increasing 
complexity of the hash function. Third, we optimize for 
the best blocking function while keeping the size of the 
largest block within a constrained size. Since the over- 
all latency of the parallel computation corresponds to the 
slowest node, this is a natural optimization goal, but is 
not addressed by existing techniques. As an aside, we 
note that our system can also be used for other applica- 
tions that require a similar capability as blocking: (a) In 
a binary classifier with many features, CBLOCK may be 
used to pick a small set of features that most effectively 
captures the classification; (b) We can use CBLOCK to 
determine which sets of values from two relations may 
contribute most to join results in a distributed join solu- 
tion such as |25|. 

The flow of data through CBLOCK is illustrated in Fig- 
ure [T] As input to the system, shown on the left, is a 
set of training examples consisting of true-positive match 
pairs shown at the top, and a set of configuration param- 
eters shown a the bottom including size constraints, dis- 
jointness conditions and any tuning of the cost objective. 
These inputs feed into the CBLOCK system, shown in the 
middle block, that designs a blocking configuration. This 
configuration is then passed to the runtime system (e.g. a 
map-reduce system) for execution of the blocking as the 
first phase of the de-duplication algorithm. 
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1.1 Contributions 



2 Related Work 



Our paper makes the following contributions, addressing 
the requirements of automatic blocking configuration for 
web-scale de-duplication: 

• In order to decrease the number of rounds (disjuncts), 
we argue that it is necessary to increase the power 
of individual hash functions while still respecting dis- 
jointness constraints. In Section [4] we formally intro- 
duce blocking trees to accomplish this goal. We show 
that in general finding blocking trees that maximize 
recall subject to a maximum block- size constraint is 
NP-hard, and we provide a natural greedy algorithm. 

• We show that adapting the state of the art solution 
of |[7l[23l to optimize for the maximum size constraint 
can naturally be expressed as a special case of finding 
optimal blocking trees. 

• Section [5] introduces the roll-up problem of merging 
small canopies produced using any disjoint blocking 
scheme. We show a close connection of the roll-up 
problem with the knapsack problem, establish the NP- 
completeness of solving the general problem, and pro- 
vide a heuristic algorithm based on a 2-approximate 
algorithm for the knapsack problem. 

• Section [6] studies "drill-down" problem, i.e., given a 
domain of an attribute and a labeled dataset of true 
duplicates we want to find optimal hash functions that 
meet a canopy- size requirement. We formally define 
the problem and present a near-linear time optimal al- 
gorithm based on dynamic programming. 

• For most of the paper we focus our attention on dis- 
joint blocking functions. Section [7] extends our study 
to non-disjoint blocking functions. To our knowledge, 
this is the first work to consider the disjointness issue 
for blocking design. 

• CBLOCK is fully implemented along with all the 
functionality described above. In Section [8] we de- 
scribe our system and present experimental results on 
two large commercial datasets consisting of around 
140K movies and 40K restaurants respectively. 

Related work is described in Section [2] Due to space con- 
straints, formal proofs for all technical results are omitted 
from the paper. 



To the best of our knowledge, ours is the first work to: 
(1) Present techniques on finding blocking functions by 
explicitly trading-off recall for efficiency, and in a more 
expressive tree-based structure than flat conjunctive struc- 
tures of past work; (2) Formally introduce and study the 
problem of rollup as an important post-processing step to 
assemble small canopies and increase recall; (3) Provide 
automatic solutions to the drill-down problem as a way 
of bootstrapping blocking with no manual effort, or aug- 
menting manually-generated hash functions; (4) Present 
an automatic blocking system for de-duplication in a dis- 
tributed setting that is applied to two large commercial 
datasets from a search engine. Very few pieces of previous 
work consider blocking based on labeled training data, 
while there is a much larger body of work on hand-tuned 
blocking techniques using similarity functions. We start 
by describing the relationship of our work with blocking 
based on labeled data (Sectio n|2.1| ), followed by blocking 
without labeled data (Section |2. 2), and finally other work 



on de-duplication (Section 2.3 ) 



2.1 Blocking With Labeled Data 

Two recent papers (Tl |23l presented approaches to con- 
structing a blocking function using a labeled dataset of 
positive and negative examples. Roughly speaking, both 
papers learn conjunctive rules (and disjunctions of con- 
junctive rules) to maximize recall. fT| attempts to max- 
imize the number of positive minus negative examples 
covered, effectively using negative examples as a proxy 
for minimizing the size. f23| uses only positive exam- 
ples, but does not explicitly incorporate any size restric- 
tion. Below we give a detailed comparison with these past 
approaches: 

• We present BlkTrees, a more expressive language for 
expressing disjoint blocking functions than previous 
work. Given only simple or conjunctive blocking 
functions, it may not be possible to construct an ef- 
fective blocking function without a large number of 
map-reduce rounds (disjuncts). 

• Minimizing negative training examples covered by a 
blocking solution may lead to quality problems from 
overly aggressive blocking. For example, consider a 
movie and a remake with the same title but released in 
a different year - while the two are a negative exam- 
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pie, this does not mean it is a bad idea to block movies 
together when their titles are similar. In short, block- 
ing should be optimized for recall vs. efficiency, and 
match rules optimized for precision. 

• Minimizing negative training examples does not 
match the cost model of parallel computation mod- 
els like map-reduce, where latency is determined by 
the largest block. 

• We are the first to introduce and solve the rollup and 
drill down problems. These problems were not ad- 
dressed in ||7l[23l, or in other past work on blocking. 

2.2 Blocking Without Labeled Data 

ifTSl introduced the notion of blocking (called "merge- 
purge") by constructing a key for each record, sort- 
ing based on the key, and then performing matching 
and merging in a sliding window. ifTSl (and other vari- 
ants |[T7l [191) do not consider automatic generation of 
optimal blocking functions in a distributed environment, 
based on training data. 

SWOOSH |5 1 is a recently developed generic entity 
resolution system from Stanford. Their specific paper on 
blocking |32| focuses on "inter-block communication", 
by propagating matched records to other blocks. Once 
again, automatic generation of blocking functions is not 
the subject of 13^ . Further, D-Swoosh |6| (and other 
similar work |25l|28l), their distributed framework for en- 
tity resolution focus on distributing pairwise comparisons 
across multiple processors, as opposed to our focus of par- 
titioning the data to reduce the number of total pairwise 
comparisons. 

Reference |22| presented techniques for generating 
non-disjoint canopies based on distance measures such as 
jaccard similarity of tokens. After choosing a distance 
function, they pick records as canopy centers, and add 
to each canopy all records that are within some distance 
based on the distance measure. The algorithms from 1221 
cannot be directly scaled to a distributed environment. A 
similar approach of generating (non-disjoint) canopies by 
clustering based on any distance measure was also pro- 
posed in |26|. Some other work |9 | considers blocking 
based on bi-grams of string attributes, followed by cre- 
ation of inverted lists for each bigram. Another recent 
piece of work 1 18 | considered transforming the data into 
a euclidean space. While the above approaches weren't 
designed specifically for a distributed environment, re- 



cently 1301 studied the problem of performing approxi- 
mate set similarity joins using a map-reduce framework. 
Their work can be used for blocking when records are 
compared for duplicates based on set similarity functions. 
Also, a recent system, MAHOUT t3J, described an imple- 
mentation of canopy clustering in a map-reduce frame- 
work. Finally, |4| performed a comparative study of 
blocking strategies from l9t [T5lfT7l l22l. 

In general, the approaches described above rely on the 
knowledge of specific similarity/distance functions. Fur- 
thermore, they necessarily generate non-disjoint canopies, 
whereas one primary goal of our work was to consider 
disjoint canopies as an important choice for distributed 
de-duplication and obtain non-disjointness as multiple 
rounds of disjoint sets of canopies. Finally, none of this 
past work considers the rollup and drill-down problems. 

2.3 Other Work 

De-duplication has been studied for over 50 years now, 
starting with the seminal pieces of work in (121 [24l . De- 
duplication of very large datasets broadly proceeds by 
performing blocking, followed by pair-wise (or cluster- 
wide) similarity computation within each block. A large 
body of work has focused on the latter step of pair- wise 
similarity computation, known as matching [121 1211 [311 . 
Some other work | 8, 14, 20 1 has considered fuzzy match- 
ing in the context of databases, however none of this work 
considers the problem of automatic blocking, drill-down, 
or rollup. Finally, we note that the structure of BlkTrees is 
akin to that of decision trees, a popular approach to classi- 
fication; however, we note that the objectives of our Blk- 
Trees are completely different, that of effectively trading 
off recall for efficiency in deduplication. 

3 Preliminaries 

3.1 Background and Notation 

We use U to denote the set of entities (i.e., records) to 
be de-duplicated. Dividing U for pairwise comparisons 
is known as blocking (or canopy formation). The divided 
pieces are called blocks (or canopies). We use C to de- 
note the set of canopies, and Q's denote the individual 
canopies. Formally, given a universe [/, a set of canopies 
is given by a finite collect C = {Ci, . . . , Ck}, Ci C U 
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and [j^Ci = U. A specific method to construct C from 
U is called a blocking function. We start by restricting 
our attention to blocking functions that create a disjoint 
set of canopies (i.e., if i ^ j, then H Cj) = 0) and 
then extend our results for non-disjoint sets of canopies 
(Section [7|. Intuitively, a good blocking function must 
satisfy two desirable properties. First, canopy forma- 
tion increases the efficiency of de-duplication by elimi- 
nating the need for performing pairwise comparisons be- 
tween all pairs of entities in U. Second, the quality of 
de-duplication (i.e., recall of identified duplicates) must 
not be significantly reduced by performing fewer compar- 
isons. Therefore, our goal is to find a set of canopies such 
that most duplicates in U fall within some canopy. We 
shall use C UxU to denote a training dataset consist- 
ing of labeled duplicates in U, over which recall of block- 
ing functions is measured. We shall construct blocking 
functions using a space H of hash functions that partition 
U based on attributes of the entities in U ; each hash func- 
tion assigns one hash value for each entity. For example, 
one hash function partitions U based on the first character 
of the titles of movies. A conjunction of hash functions 
hi^ ... ^hi is equivalent to creating a single hash value by 
concatenating the hash values obtained by each hi, effec- 
tively creating partitions (equivalence classes) where val- 
ues of each of the hash functions matches. Typically H is 
generated manually based on domain knowledge, and we 
shall present techniques to construct blocking functions 
using any H. In addition, we shall also present techniques 
to automatically identify optimal hash functions for each 
attribute (Section [6]). 

3.2 Cost Model 

While CBLOCK can be configured with any cost model 
for optimizing canopy formation, we use latency as the 
default cost model in our discussion]^ The latency of any 
canopy formation is given by the total time it takes to per- 
form all pairwise comparisons in each canopy. 

In a grid environment (such as our de-duplication sys- 
tem implemented using map-reduce), pairwise compar- 
isons in the set of canopies are performed in parallel. 
Given a canopy formation C = {Ci, . . . , (7^}, with num- 
ber of entities in canopy Ci denoted by Si, the total num- 



ber of pairwise comparisons being performed is J^iLi • 
Motivated by de-duplication in a grid environment, we 
use the cost model cost(C) = max^ sf. Clearly, in a truly 
elastic grid with a potentially infinite supply of machines, 
pairwise comparisons for each canopy are performed on 
a separate machine. Therefore, the latency is given by the 
largest canopy, justifying our cost model of using max^ Si. 

When the number of machines on the grid are lim- 
ited (and specifically, when there are fewer machines than 
canopies), we are faced with the problem of assigning 
canopies to machines. The following theorem shows that 
this assignment is NP-hard in general, based on a direct 
reduction from a scheduling problem. However, we also 
show that the latency using the largest canopy gives an 
upper bound on the best possible assignment. 

Theorem 3.1 Given a set M = {Mi, . . . , M^} of m 
machines, a canopy formation C = {Ci, . . . , C/c} over 
N entities, m < k, any assignment A : C ^ M of 
canopies to machines has a cost given by costAiC) — 
^^^T=i(^c,:Aic,)^M, l^iP)- We have: 

1. It is NP-hard to find an assignment that minimizes 

COStAiC). 

2. For all assignments A, we have max^^^ l^iP ^ 
costA(C) < (1 + ^) max^_;L ^^^^ specifically, 

let X = max(max,^_i ^Hl^)- ^^^^ 

X < costA{C) < 2X. 

Based on the theorem above, henceforth, we focus on the 
problem of finding best canopies that satisfy the constraint 
of maxi Si < S, for some given S. 



4 Blocking Based on Labeled Data 

This section addresses the problem of constructing dis- 
joint blocking functions using a labeled dataset of posi- 
tive examples. After formally defining the problem (Sec- 
tion 



4.1 ), we introduce a tree- structured language for ex- 



pressing blocking functions (Section 4.2). We then show 
that the general problem of finding an optimal blocking 



^ All our algorithms and complexity results carry over for any "mono- 
tonic cost function", i.e., cost(C) < cost(C^) whenever VC G C, 3C' G 
C such that C C C^ 



function is NP-hard (Section 4.3 ), and finally we present 
a greedy heuristic algorithm (Section |4.4| ) to find an ap- 
proximate blocking function. 
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4.1 Problem Formulation 

We formally define the problem of creating canopies 
given labeled data consisting of examples of duplicates 
(positive pairs). Recall the two conflicting goals of 
canopy formation: The more divisive a set of canopies 
is, the more likely it is to miss out on true duplicates. 
We formulate an optimization problem that trades off the 
two objectives of canopy formation, by associating a hard 
constraint on the maximum size of each canopy and max- 
imizing the number of covered positive examples (recall) 
subject to this size constraint. 

Definition 4.1 (Blocking Problem) Given a labeled set 
of positive examples, a space 1-L of hash functions, a 
size bound S on every canopy, and a size function size{) 
that returns the size of a canopy obtained by applying any 
conjunction of hash functions in H on any input dataset 
I, construct a disjoint blocking function B that parti- 
tions any input I into a set C of disjoint canopies of size 
at most S, while maximizing the number of pairs from 
that lie within canopies, i.e., maximizing: recall = 

|{(ri,r2)Gr+|3cGC,ri,r2GC}| 

ir+i 

We make a few important observations about our problem 
definition. (1) As a reminder, we start by considering only 
disjoint blocking, and extend to non-disjoint blocking in 
Section [T] The next section describes a language to repre- 
sent disjoint blocking functions (S), and subsequently we 
give algorithms for finding B. (2) We assume that there 
is a known size estimation function. In practice, some 
previous work on blocking | 7 1 has used negative exam- 
ples as an indirect way of incorporating size restrictions. 
Alternatively, previous work on estimating the cardinal- 
ity of selection queries using histograms (refer 1 16 |) can 
be used to estimate canopy sizes, as we shall see each 
canopy is obtained as a conjunction of hash functions. Of 
course, if the entire dataset were available during the con- 
struction of blocking predicates, it could be used for size 
computation. (In particular, exact size computation for 
the blocking technique we propose can be done in a few 
scans. Also, we shall see that our technique can be adap- 
tively applied even in case of inaccurate size estimates.) 
(3) For this section we assume the existence of a space 
H of hash functions. Most previous work has assumed 
the manual creation of such atomic hash functions. We 
also present in Section [6] an automated method of enu- 
merating hash functions for each attribute. (4) Finally, we 



assume the positive examples are known; we describe 
the construction of this dataset in the experiments section 
(SectionjH). 

4.2 Blocking-Tree Space 

This section presents a generic language for expressing 
disjoint blocking functions. We introduce a hierarchical 
blocking tree (called BlkTree), that partitions the entire set 
of entities in a hierarchical fashion by successively apply- 
ing atomic hash functions from a known class H. For- 
mally: 

Definition 4.2 A BlkTree B — (N^E^h) is composed of 
a tree with nodes N and edges E, and h : N ^ H maps 
each node in the tree to a particular hash partitioning 
function from H. 

Intuitively, each leaf node of the tree corresponds to a 
canopy. The BlkTree is built using the inputs described 



in Definition 4.1 namely the training data, a known space 
of atomic hash functions H, and canopy-size estimates. 
Each node n e N in the tree corresponds to a set of en- 
tities from the entire set obtained by applying the hash 
functions from the root down to n. Each node n (with a 
size estimate exceeding the allowed maximum) then ap- 
plies a particular partitioning hash function to create dis- 
joint partitions of the set of entities corresponding to n. 

At run-time, each entity is run through the BlkTree, 
and directed to the machine in the cluster based on the 
leaf node. (Note that in a distributed environment, the en- 
tire data itself is initially partitioned across multiple ma- 
chines; therefore, the BlkTree is stored on every machine 
in order to redistribute the data based on the canopies.) 
Note that in practice the total number of large canopies 
created by any hash function on any node is a constant, 
for instance due to NULL values in the data, or a com- 
mon default value for an attribute. Therefore, the size of 
the constructed BlkTree in terms of the number of nodes 
is small, so that the BlkTree fits in memory, and applying 
the BlkTree to an entity is efficient. 

Example 4.3 Figure [2] shows an example BlkTree for 
movie data with the root partitioning the movies lexico- 
graphically based on the title. This partition results in 
two large canopies — the node corresponding to NULL ti- 
tles, and the node corresponding to titles that start with 
'T" (assume all titles have been capitalized in advance). 
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Figure 2: Example of a tree- structured disjoint blocking func- 
tion. 

In the NULL canopy a partition based on the release-year 
of the movie is performed, while the movies starting with 
'T" are partitioned by the name of the movie's director 
All leaf nodes in the resulting tree satisfy the maximum 
canopy -size requirement, and hence no further partition- 
ing is performed. 

4.2.1 Restricted languages 

We note that BlkTrees are a very expressive language to 
describe disjoint blocking functions. In particular, the fol- 
lowing natural languages are obtained as restrictions of 
BlkTrees: 

1. Single hashes: Clearly single hash functions are 
equivalent to a BlkTree of height 1. 

2. Conjunctive functions (chains): Conjunctions of 
hash functions are equivalent to restricting the width 
of BlkTrees to 1, i.e., a branching factor of 1. In 
particular, a conjunction hi A ... A is equivalent 
to applying each hash function hi in sequence to ev- 
ery single canopy, irrespective of whether a canopy is 
smaller than the required size S. Note that (disjunc- 
tions of) chains are the basic construct used in f7ll23l, 
where as our language of (disjunctions of) BlkTrees is 
significantly more expressive. 

3. Chain-tree: Chain-trees are an extension of conjunc- 
tions where we are again specified a chain hi, ... ^h^ 
of hash functions to be applied in sequence, how- 
ever, subsequent hash functions are applied only if the 
canopy size exceeds the allowed maximum. In partic- 
ular, chain-trees are obtained by restricting every level 
of BlkTrees to have the same hash function. 

In our experiments, we implement algorithms for Blk- 
Trees and all the restricted languages above and compare 
them in terms of recall, to observe a significantly higher 
recall using BlkTrees. 



1: Input: Node n consisting of entities Cn, duplicates T"*", space 7i of hash 

functions, size bound S. 
2: if |C^| > 5 then 
3: least = oo; best =NULL 
4: fork en do 

5: Compute e = eHm-count(Cn, , •S', ^) 

6: if least > e then 

7: least — e; best — h 

8: end if 

9: end for 

10: Set 6es^ as the hash for node n. 

1 1 : Recurse on nodes resulting from best applied to n. 

12: end if 



Algorithm 1: Recursive greedy construction of BlkTree. 

4.3 Intractability 

Next we demonstrate that the general problem of finding 
the optimal BlkTree is NP-hard, and subsequently present 
a heuristic greedy algorithm. 

Lemma 4.4 (BlkTree intractability) Given a training 
set with positive examples, a space 1-L of hash func- 
tions, and a bound S on the maximum size of any canopy, 
assuming P ^ NP, there does not exist any polynomial- 
time ( in , 1-L ) algorithm to find the optimal BlkTree. 

4.4 Greedy Algorithm 

We propose a simple heuristic for constructing the Blk- 
Tree described in Algorithm[T] The general scheme of the 
algorithm is to locally pick the best hash function at every 
node in the tree, if the size (estimate) of the number of 
entities in this node is over the allowed maximum S. (If a 
particular hash function generates many large canopies, it 
is ignored, in order to maintain a small BlkTree. However, 
as described before, the number of large canopies is typi- 
cally small; in our experiments over 140K movie entities, 
no hash function created more than a few large canopies.) 
The best hash function for a node is picked greedily by 
counting for all hash functions h ^ 1-L, the number of 
duplicates that get eliminated on choosing the hash func- 
tion h. The hash function that minimizes the number of 
eliminated duplicates is chosen. We describe three ways 
of counting the number of examples eliminated (function 
elim-count in Algorithm [T]). Suppose a node n has 
positive pairs, and application of h eliminates Ph dupli- 
cates and creates canopies Ci, . . . , C/c exceeding size S 
(among other canopies that are smaller than S). If the 
number of positive pairs in Ci is denoted P(Ci), then the 
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three ways of counting the drop in the number of positive 
pairs are as follows: 

• Optimistic Count: Intuitively, Algorithm [T] picks 
the hash function by assuming that no more dupli- 
cate examples would get eliminated, hence it is opti- 
mistic: 

Optimistic = 

• Pessimistic Count: On application of a hash func- 
tion h, we say that the number of duplicates that are 
eliminated include the ones broken by h as well as 
all examples that still remain in canopies larger than 
S: 

Pessimistic = (P^ + ^ P{Ci)) 

2 = 1. ./C 

• Expected Count: For the duplicates that still remain 
in large canopies after applying h, we compute an 
expected number of eliminated duplicates based on 
a random split of the canopy so as to obtain canopies 
of size S: 

Expected = (P/^ + > ^ — -) 

where = [^^]. Effectively, a random split 
would only retain a ^ fraction of the positive pairs, 
assuming each pair is independent. 

Finally we note that an important feature of construct- 
ing the BlkTree is that it can be naturally adapted at run- 
time based on the actual canopy sizes, such as when the 
canopy-size estimates turned out to be inaccurate, or when 
available memory has reduced. Suppose while construc- 
tion of the BlkTree a canopy-size bound of S (say, 5000 
entities) was imposed, we may choose to construct the 
BlkTree based on a maximum canopy- size of a fraction 
of S (say, 1000 entities). Effectively, we will create a 
longer tree than necessary, and this "extra" portion of the 
tree may be used if any canopy needs to be split further 
based on the reasons described above. Conversely, if the 
actual canopy sizes turn out to be smaller than expected, 
we may choose to run through only a smaller part of the 
tree. 



5 Rolling up small canopies 

In this section, we introduce the problem of rolling up 
small canopies. The primary motivation for studying this 
problem is that a blocking function may unnecessarily 
have to create many small canopies, in order to make 
some of the larger canopies fit the required size bound. 
Therefore, as a post processing step, we can take the re- 
sult of any blocking function, and combine multiple small 
canopies maintaining the size requirement yet increasing 
the overall recall. 

We are given a set of canopies C = {Ci , C2, . . . , Cm}, 
where each canopy Ci has size (much) less than our 
canopy size limit S. We are also given a set of 
pairs of matching records = {. . . , (r^^ ^ri^)^ . . .}. 
The rollup problem is to find a set of canopies V = 
{P>i,P>2, . . ',Di} such that 

• Disjointness Constraint. j, i ^ j, Di n Dj =0 

• Roll Up Constraint: Mi, ^2, • • such that, Di = 

• Maximum Size Constraint: Vi, \Di\ < S 

• Maximize Recall: minimize the number of pairs of 
matching examples from that are split across 
canopies. 

Note that the rollup problem can be applied on any set of 
canopies generated using any previous blocking function. 
In particular, it can be applied on the BlkTree blocking 
function generated in Section |4] Each leaf of the BlkTree 
corresponds to a canopy, and by applying rollup, some 
leaves of the BlkTree get merged so as to maintain the 
size requirement but increase recall. Figure |3] shows an 
example blocking function obtained by performing rollup 
on the BlkTree in Figure [2] Note that although the re- 
sulting blocking function isn't a tree, the resulting DAG 
can still be used for distributed canopy formation: Each 
entity starts at the root and traverses all the way down 
through the directed edges to a (possibly rolled-up) leaf 
node, which corresponds to a canopy. 

We start by showing that the roll-up problem is in- 
tractable: 

Lemma 5.1 (NP-completeness) The rollup problem de- 
scribed above is NP-complete. 

Next we propose a greedy heuristic for the rollup prob- 
lem that is inspired by Dantzig's 2- approximation al- 
gorithm 1101 for the knapsack problem. Conceptually, 
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Figure 3: Rollup applied on canopies (leaf-nodes) generated in 
Fig.[2] 



our algorithm (Algorithm [2]) starts with the initial set of 
canopies, and progresses in steps. In each step, the algo- 
rithm finds the pair of sets Di , D2 that together have less 
than S records, and maximize the following quantity: 

benefit{DuD2)/ mm{\D^l\D2\) (1) 

benefit{Di^ D2) is the number of matching pairs 
(^21,^2) ^ P s^^h that e Di and ^ D2. Intu- 
itively, in each step we pick the canopy that has the small- 
est size but also puts a large number of matching pairs in 
the same canopy. 

Algorithm[2]can be efficiently implemented in time lin- 
ear in the number of matching pairs (|T^ |) and quadratic 
in the number of input canopies (|C|). Initially, we com- 
pute for each canopy D e C, one merge candidate. This 
is a canopy D' such that \D'\ > \D\ and \D\ + \D'\ < S 
such that bene f it {D^ D') is maximum. This step takes 
0(\T^\ • |Cp) time. In each step, we find the canopy D 
whose merge candidate has the maximum benefit to size 
ratio; we then merge D with its merge candidate. The 
new merge candidate for a canopy other than D and D' is 



either D{J D' ox its old merge candidate - this step takes 
0(1) time for each canopy. The new merge candidate for 
D [J D' can be computed in 0{\T^\ ■ \C\) time by con- 
sidering all the other canopies and the positive examples. 
Since the algorithm terminates in at most \C\ steps, our 
algorithm has O ( | \ • | C p ) time complexity. 

6 Drill-Down Problem 

In Section |4] we assumed a pre-existing and manually- 
generated space of hash functions (as is done in most 
previous work). Next we propose automatic (only us- 
ing an attribute's domain and labeled dataset) techniques 
for generating hash functions. Automatically constructed 
hash functions may be used to bootstrap the blocking 
methods, eliminating the need for a significant upfront 
manual effort. Moreover, even in the presence of an exist- 
ing space of manually constructed hash functions, we can 
augment the space with (better) automatically generated 
hash functions. 

We introduce the "drill-down" problem for a single at- 
tribute. Our goal is to optimally divide a single-attribute's 
domain into disjoint sets so as to cover as many dupli- 
cate pairs as possible, but ensuring that the cost associ- 
ated with any set is below a required threshold. First we 
formally define the partitioning of an attribute's domain 
into disjoint, covering, contiguous subsets (called a DCC 
partition), then define the problem of finding an optimal 
DCC partition. 

Definition 6.1 (DCC Partition) Given a domain D with 
total ordering least element 'start ' and greatest ele- 
ment 'end^ we say that a set I is a DCC partition ofD 
ifyi^X'.ICD and all of the following hold: 

• Disjoint: h.h e IJi ^ h ^ h nh = 9 

• Contiguous subset: Every I e X is of the form 

or {I\P), ^ P and 
e DaI^ {start, end} 

• Covering: Da = U/gx ^ '-' 

Intuitively, a DCC partition completely divides D by 
"tiling" the entire domain. Also, note that the total order- 
ing doesn't need to correspond to the "natural ordering" 

^The least and greatest element may be part of D in some cases (e.g., 
all 10-digit phone numbers) and not part of D in others (e.g., —00 and 
+00 for real numbers). 



1: Input: C = {Ci, C2, . . . , Cm}, set of matching pairs T"^, maximum 

canopy size k 

2: SetV ^ C // initialize 

3: repeat 

4: // Candidate pairs that can be merged 

5: I^pa^. ^ {{D^,D2) I \Dr\ + ID2I < k} 

6: iiVpair ^ then 

7- ^ aro-mav^ bene f it{D-^ ,D2) 

I. ^ arg maxi^p^ min(|i?i | ,1^2 I) 

8: //Merge and D2 into one canopy 

9: V^VU {DI U D^} - {Dl,D*} 

10: end if 

11: until Dp air = 

12: Return D 

Algorithm 2: Greedy Canopy Rollup Algorithm 
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Figure 4: Example of drilling-down on release-year of 
movies. 

such as lexicographic for strings or '<' ordering for nu- 
meric. For instance, may choose to order director names 
by their last name and find a hash function, then also order 
them by first name and find another hash function. 
Next we formally define the drill down problem. 

Definition 6.2 (Drill Down Problem) Consider a single 
attribute A with an ordered domain Da^ ^, starts end, a 
set of n duplication pairs = {{al^ a^), . . . , (a^, a^)}, 
Wi : aj e Da, a] ^ <^f, and any monotonia black-box 
cost functior^cost : / C D ^ R, and a maximum cost 
bound S on any partition. Our goal is to find a DCC 
partition X of Da such that: 

1. \JI el: cost{I) < S 

2. Let cov{X^T^) be the number of duplicates cov- 
ered: C0V{X,T+) = El<i<n-3IeTwith[ala^^]CI^' 

For any DCC partition X' satisfying (1) above, 

cov{x,T^) > cov{r,r^). □ 

Example 6.3 Figure [?] gives an example of a hash func- 
tion that may be obtained using the drill down problem. 
This hash function may be added to the existing space 
of hash functions in consideration by a blocking function 
construction algorithm such as Algorithm^ 

Next we provide an optimal polynomial-time algorithm 
for the drill down problem based on dynamic program- 
ming. We use two core ideas in the algorithm described 
next. First, suppose we are finding the first partition in the 
given domain, the only "interesting endpoints" of a parti- 
tion must be either a value at which a duplicate entity lies, 
or must be due to the boundary caused by the cost bound. 
Intuitively, we discretize the domain, and now only need 

^V/,/' C Da : / C r ^ cost(/) < cost(/0. Note that, in 
practice, for uniformly distributed data the cost function may simply 
bound the total size of the interval. But for skewed data, the size of 
the interval depends on the density of the data; therefore, we allow any 
arbitrary cost function. 



to look at a finite number of endpoints in constructing the 
optimal partition; the space of possible DCC partitions 
still remains exponential. This observation is formalized 
below. 

Lemma 6.4 (Interesting Endpoints) Given a domain 
D, with least and greatest elements start, end, with 
duplicate pairs = {(a^, a^), . . . , (a^, a^)}, cost 

function I and a cost bound S, consider finding the first 
partition [start., X) or [start., X] (or open interval on 
start if start ^ D) for the drill down problem. Let Y ^ 
end be the greatest value such that cost{[start^ Y]) < S, 
then there is an optimal drill down solution with X G 
{{Y}u{ai\ai^Y}). 

The second observation is the optimal substructure prop- 
erty exploited by our dynamic programming algorithm. 
Given a domain I^, ^, starts end over which we want to 
solve the drill down problem, the optimal solution for 
a sub-domain Ds, start' ^ end, with start' >- start, 
with the same cost function and cost bound is identical ir- 
respective of the partitions chosen for D — Dg, i.e., from 
start to start'. This property allows us to memoize the 
solutions for all sub-domains of known interesting end- 
points, namely from aj to end, for every aj . We can then 
find an optimal solution to the entire domain by recur- 
sively considering sub-domains, as formalized below. 

Lemma 6.5 (Optimal Substructure) Given a 
domain D^ starts end with duplicate pairs 
= {(a]^, a^), . . . , (a^, a^)}, cost function I, 
cost bound S, let Y be greatest value satisfying cost 
bound (as defined in Lemma Let V{I) be the total 
number of violations in the optimal solution for the subset 
of with each endpoint in I. Then, V{D) can be 
recursively computed as: 

V{[a,end]) = min {B{[a, P])^V{{P,end])) 

Pei{Y}U{al\a^al^Y}) 

where B{[a^ P]) is the number of duplicate pairs broken 
due to the interval B{[a^P]); i.e., B{[a^P]) = \{i\a ^ 
a]<P^an\. 

The above lemma provides a natural dynamic program- 
ming algorithm (described in Algorithm |3]), where we re- 
cursively solve the drill down problem for sub-domains, 

similar expression for B([a, P)) + V{[P, end]), which is omit- 
ted. We have a similar formula for every combination of open and closed 
interval, i.e., [a, end), (a, end), (a, end]. 
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1: Input: D = [a, 6], T"^, cost function: cost(-), cost bound S, memoized 

solutions M : Ds ^ NULL U N 
2: if M(D) NULL then 
3: Return M(D). 
4: end if 

5: if r+ = 0then 

6: M{D) = 0. Return M{D). 

7: end if 

8: Compute max Y -value from a using cost(-), S (Lemma [6!4) . 
9: Let^ = i{Y} U {a^' G r+|a ^ aj' ^ Y}) 
10: Minimum value m = oo, endpoint popt = NULL 
11: for P e ^do 

12: Compute B ( [o , P] ) using T+ (Lemma^^ 
13: cand = P([a, P]) + y([P, end]) [P, 
sively 

14: M([P, end]) = y([P, end]) 
15: if m > cand then 
16: m = cand; Popt = P 

17: end if 
18: end for 

19: M(P)) = m. Return M(P)) 



ad]) computed recur- 



Algorithm 3: Sketch of the dynamic programming algo- 
rithm with memoization to solve the drill down problem 
for a given domain D with a set of duplicate pairs . 



and memoize these solutions for future recursive calls in 
M; initially, no solution is memoized. Algorithm [3] re- 
turns the total number of violated duplicate pairs but also 
tracks the specific endpoints. It can be seen easily that 
this algorithm runs in near-linear time and space based on 
the observation that the total number of different recursive 
calls is at most 0{n)\ 0{n) corresponding to all possi- 
ble endpoints of duplicate pairs, and another 0{n) corre- 
sponding to each maximum F -value from Lemma [63] for 
each endpoint. 

So far we have considered the drill down problem un- 

We 



der the disjointness condition (recall Definition 6.1 



finally note that the drill down problem is trivial if we 
were allowed a non-disjoint set of intervals: We simply 
look at each duplicate pair (aj, ) individually and cre- 
ate an interval = [a^, a|] if and only if cost(/i) < S. 



7 Non-Disjoint Canopies 



size of 2, and a perfect recall of 1. However, construct- 
ing a canopy for each pair is clearly prohibitive, as it in- 
curs a large communication cost, i.e., each entity needs 
to be transferred to machines corresponding to 0{\U\) 
canopies. Therefore, we introduce a cost metric that min- 
imizes the combination of communication and computa- 
tion cost. The cost of a set C = {Ci, . . . , Cm} is given 
by: 



cost(C) 



max I Ci 

l<i<m 



The computation cost, as before, is approximated by the 
computation for the largest canopy, where a complete 
pairwise comparison is performed. The communication is 
given by the total size of all canopies put together, which 
is roughly the number of entities that need to be trans- 
ferred to different machines. 

We address the problem of finding non-disjoint 
canopies as finding sets of canopies Ci,C2,..., where 
each Ci is a disjoint set of canopies. In a distributed en- 
vironment, each C can be performed in one map-reduce 
round. (Alternatively, if non-disjoint canopies are inher- 
ently supported, we may simply construct a single set C 
of canopies as C = Ui^*-) When treating non-disjoint 
canopies as multiple rounds of disjoint canopies, once we 
bound the computation cost (i.e., the size of the largest 
canopy) in each round, our goal reduces to minimizing 
the number of rounds to obtain maximum recall with re- 
spect to a training dataset. 

We present a generic algorithm (Algorithm |4]) that ex- 
tends any algorithm for disjoint canopy formation to an 
algorithm for the non-disjoint case. We assume a bound 
on the maximum computation in any round, and use the 
disjoint algorithm to maximize recall in a round. The du- 
plicate pairs that are covered are then removed from the 
labeled dataset, and the next round is performed. We may 
truncate the algorithm when all pairs are covered, or no 
more pairs can be covered, or a pre-specified maximum 
number of rounds has reached. 



In this section we consider the construction of a set of 
canopies that don't need to be disjoint. The first thing 

to note is that we need to revise our cost model from 8 ExperillieiltS 

We note that a cost function that only pe- 



Section 3.2 



nalizes the size of the largest canopy doesn't suffice any 
longer: Given a set U of entities, we can create ^^^'^^^^""^^ 
canopies, with one canopy for each pair of entities in U. 
Note that this set of canopies has a maximum canopy 



This section presents a detailed experimental study us- 
ing two large commercial datasets at Yahoo: (1) a movie 
dataset consisting of 140K entities, and (2) a restaurants 
dataset consisting of 40K entities. We present a summary 
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Input: Labeled data T"^, maximum canopy-size bound S, disjoint-algorithm 
AlgoDisj returning the covered pairs, (optional) bound on the number of 
rounds R. 

numRds= 0, change = true 

while (T+ / 0) A (change) A (numRds < R) do 
numRds=numRds-\-l; change— false 
Covered=AlgoDisj(T^, S) 
if Covered^ then 

T+ = T+ - Covered 
change^ true 
end if 
end while 



Algorithm 4: Generic algorithm for performing non- 
disjoint canopy formation as multiple rounds of disjoint 
canopy formation. 



of results based on both the datasets, but focus on movies 
for a more detailed evaluation. (We focus only on one 
dataset for a detailed evaluation due to space constraints; 
the movies dataset being larger makes for a more inter- 
esting study although trends are similar in the restaurants 
dataset.) 

The primary goal of our study is to measure the ef- 
fectiveness (increased recall) due to the more expressive 
BlkTree-based blocking, as compared to restrictions of 
BlkTrees. We measure recall for disjoint and non-disjoint 
versions of all our algorithms. In addition to the primary 
objectives described above, our experiments also under- 
stand the effects of increasing the size of canopies on re- 
call, variation of recall with the number of disjuncts, ef- 
fects of specific greedy strategies used, and understanding 
some basic properties of BlkTree-based blocking. Our ex- 



by finding common references to IMDb fT\ movies; 
a small sample of 100 automatically generated pairs 
were checked manually to confirm that these were all 
duplicates. The schema of movies consisted of attributes 
title, director, release year, runtime, 
genre on which hash functions were created, and also 
other attributes (such as genre and crew members) 
that weren't used for blocking. A sample of the space 
of hash functions used in our experiments is shown in 
TablelU 

The restaurants dataset used in our experiments con- 
sists of 40, 000 restaurant records with attributes name, 
street, city, state, and zip. After de-duplication, 
there are 13, 000 unique restaurant records. We use a la- 
beled dataset of 4, 674 duplicate pairs, and we used a sim- 
ilar set of hash functions as in Table [T] 

Metrics 

We evaluate our canopy generation algorithms using two 
metrics - recall and computation cost. Recall is measured 
as the fraction of matching pairs in that appear within 



perimental setup is described in Section 8.1 and results are 



presented in Section 8.2 



some canopy (Definition |4.1| ). Our algorithms are used to 
learn blocking hash functions, which are in turn applied 
to new data. We measure the computation cost in terms 
of the time taken to apply the hash function learnt by our 
algorithms on the dataset. Note that this is not the time 
taken to learn blocking functions. For non-disjoint canopy 
formation (Algorithm]?] in Section [7]) we measure the in- 
crease in recall as the number of disjuncts (or map-reduce 
steps) is increased. 



8.1 Experimental Setup 
Dataset 

We have applied CBLOCK on two commercial 
datasets from a search engine company: movies 
and restaurants. The primary movies dataset used 
in our experiments is a large database D movie of 140K 
movies from Yahoo. In addition, we use a sample of 
movies from DBPedia |1 1 to obtain new duplicates, in 
addition to the duplicate already existing in D movie- 
We constructed a labeled dataset consisting of 1054 
pairs of duplicates: Around 350 pairs of duplicates 
were obtained using manual labeling by paid editors. 
The remaining 704 pairs were obtained automatically 



Algorithms 

We describe our algorithms next. If any of our algo- 
rithms result in canopies C with size larger than our max- 
imum size limit 5, we further split it randomly into [^] 
smaller parts. The algorithms we compare are 

• Random (R): Each entity in U is assigned uniformly 
at random to one of [^] canopies. 

• Single-Hash (SH): Canopies are formed by picking a 
single hash function which maximizes recall. 

• Chain (C): Canopies are formed by picking the best 
conjunction of hash function. (Note that this is the 
"size-aware" analogue of the approaches taken by pre- 
vious work |[7l|23l on using labeled data.) 
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Attributes 


Hash function 


All 


(1) h{x) = x; (2) h{x) = prefix/suffix of length K; (3) h{x) = most frequent K characters in alphanumeric order; (K = 1,3,5) 


title 


h{x) ^longest token of x 


year, runtime 


number rounded to nearest to create fc-point intervals, i.e., h{x) = x — {x mod k) 


director 


(1) h{x) =first-name of x; (2) h{x) =last-name of x 



Table 1: Sample of the space of hash functions on movies used in our experiments. 



• Chain-Tree (CT): A restriction to our tree based hash 
function where the same hash is used at each level. 
SH, C, CT were described earlier in Section |4.2.1| 

• Hierarchical Blocking Tree (HBT): Our BlkTree- 
based canopy generation algorithm presented in Al- 
gorithm [T] 

We also consider non-disjoint variants of all our al- 
gorithms; if A G {SH, C, CT, HBT} denotes one of the 
above algorithms, we use A-ND to denote its non-disjoint 
variant (i.e., using A in Algorithm]?]). 

Setup 

We perform 5 -fold cross-validation for all runs of al- 
gorithms: We split into 5 equal pieces randomly, 
then average over five runs with each run using 4 pieces 
of as a training set to obtain the blocking func- 
tion, then use the 5th piece as a test set. Since we 
don't make any novel contribution on size estimation, 
our oracle size{) computes the exact sizes of canopies 
based on the entire dataset. Our experiments were per- 
formed by varying the allowed maximum canopy size 
with IK, bK, 1{)K, 20K, lOOK entities per canopy. 

8.2 Results 

We start by presenting detailed results on the movie 
dataset (Section [8.2. lf[8.2.3] ). Finally, we present a brief 
summary of results on the restaurants dataset in Sec- 
tionEZl 



8.2.1 Disjoint Canopies 

Our first experiment was to compare the overall recall 
obtained by each of the algorithms — R, SH, C, CT, and 



we picked the best of the optimistic, pessimistic, and ex- 
pected greedy picking strategies.) The most important 
observation is that HBT achieves a significantly higher 
recall than C and SH, particularly when the maximum 
canopy size is lower. The reason for HBT's higher recall 
is the greater expressive power of BlkTrees as a construct 
for describing disjoint blocking functions; BlkTree's are 
able to apply a hash function at the first level that cre- 
ates many good small canopies and a few large canopies, 
which are further split at subsequent levels of the tree. 



Another interesting observation from Figure |5(a)| is that 
CT performs roughly as well as HBT, despite the slightly 
lower expressive power: Intuitively, the added power of 
HBT is effective when different nodes in the same level 
need different hash functions. Such a case would arise 
when different sections of the data have differing proper- 
ties (e.g., US movies versus German movies); our dataset, 
however, only contained US movies. Finally, as expected 
R gives the lowest recall among all algorithms; hence- 
forth, we omit R from the rest of our experiments. 

To further understand the effects of the three 
greedy picking strategies — optimistic, pessimistic, and 
expected — described in Section |4.4[ in Figure |5(b)| we 



HBT. Figure |5(a)| shows the recall obtained by each of 
the algorithms on the movie dataset, varying the maxi- 
mum allowed canopy size. (For each of the algorithms, 



plot the recall for each of the algorithms by varying the 
greedy picking strategy. We note that in most cases all 
three algorithms perform very similarly, with the opti- 
mistic picking strategy slightly outperforming the oth- 
ers. The intuition for optimistic greedy strategy perform- 
ing slightly better is that an optimistic estimate is better 
than an expected estimate since future levels of blocking 
are significantly better than a random split of each large 
canopy. Since optimistic is never worse than the other 
strategies, for the rest of our experiments we choose the 
optimistic strategy for each algorithm. 



8.2.2 Non-disjoint Canopies 

Next we consider the non-disjoint variants of each of the 



algorithms. Figure 5(c) shows the overall recall for each 
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5000 10000 20000 

Max. Canopy Size 




■ Pessimistic 

■ Expected 

■ Optimistic 




-^SH-ND 
♦C-ND 

-*-CT-ND 
^HBT-ND 



1000 5000 10000 20000 100000 

Max. Canopy Size 



(a) Comparison of disjoint canopy forma- (b) Comparison of optimistic, pessimistic, (c) Comparison of overall recall for non- 
tion algorithms, varying the maximum size of and expected greedy picking strategies for disjoint canopy formation algorithms, vary- 
canopies. each algorithm, fixing the maximum canopy ing the maximum size of canopies. 

size to 10000. 




-HBT-ND 
i-HBT 



Max. Canopy Size 




(d) Benefit of performing non-disjoint (e) Variation of recall as the number of 
by comparing HBT-ND and HBT vary- rounds is increased, for all non-disjoint 



Approach 


Algorithm 


100 


200 


500 


1000 


Disjoint 


C 


0.33 


0.49 


0.66 


0.75 


HBT 


0.84 


0.87 


0.89 


0.91 


Non-disjoint 


C-ND 


0.38 


0.51 


0.66 


0.75 


HBT-ND 


0.97 


0.98 


0.99 


0.99 



ing maximum canopy size. 



canopy formation algorithms with maxi- 
mum canopy size 5000. 



(f) Summary of recall on applying CBLOCK on the restau- 
rants dataset, varying the maximum allowed canopy size. 
Comparison of BlkTree-based and conjunctive blocking. 



Figure 5: Experimental Results 



non-disjoint algorithm as the number of canopies is var- 
ied. We notice, once again, that HBT-ND achieves a sig- 
nificantly higher recall than C-ND and SH-ND. In partic- 
ular, C-ND is the size-aware analogues of previous state 
of the art fT, ^23] . The reason for a much higher recall in 
HBT-ND is again the larger space of blocking functions 
BlkTrees can represent. Specifically, any conjunction that 
contains even one canopy larger than the maximum al- 
lowed is not permitted (or the conjunction needs to be 
further restricted losing more duplicate pairs). Note how- 
ever that overall recall of CT-ND is very similar to that of 
HBT-ND; however, we shall see shortly that in the initial 
rounds of disjunction, HBT-ND increases recall slightly 
more rapidly than CT-ND. 

A second observation on non-disjoint canopy formation 
is that the non-disjoint versions of each algorithm obtain 
higher recall than the corresponding disjoint versions. In 
Figure |5(d)| we show the increase in recall obtained by 
HBT-ND as compared to HBT, for each maximum canopy 
size. Note that the additional benefit of non-disjointness 



diminishes as the maximum canopy size is increased. 

Next let us take a closer look at how the recall changes 
as the number of iterations is increased. To examine the 
difference between CT-ND and HBT-ND (as well as other 
non-disjoint algorithms), we plot the recall obtained after 



each round of disjoint canopy formation. Figure 5(e) plots 
the overall recall for the case of maximum canopy size 
5000; we picked one fold of our cross-validation in which 
CT-ND ends with a slightly higher recall than HBT (there- 
fore the apparent discrepancy with Figure |5(a)|). First, 



note that for every iteration, HBT-ND is better than C- 
ND and SH-ND, which means that the number of posi- 
tive examples covered increases more steeply for HBT- 
ND. Second, we see that HBT-ND obtains a higher recall 
than CT-ND initially, but CT-ND eventually ends with a 
slightly higher recall; in other words, with a limited num- 
ber of map-reduce rounds, HBT-ND performs better than 
CT-ND. An optimal strategy of choosing a non-disjoint 
canopy formation by combining HBT-ND and CT-ND is 
left as future work. 
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Function 


SH 


CT 


HBT 


Time per record (ms) 


8.8 


17.7 


7.1 


Average tree height 


1 


2.8 


1.34 



Table 2: (1) Average running time (in /is.) of applying 
the best blocking function for each record. (2) Average 
length of the tree. All the numbers are for a max canopy 
size of 10, 000. 

8.2.3 Computation cost and Tree size 

Computational Cost: We compared the computation 
cost (i.e., running) of applying a BlkTree against the cost 
of applying other blocking functions. The primary ob- 
jective of investigating the running time is to establish 
the fact that BlkTrees do not add significant burden on 
the time required to apply the blocking function on an 
entire dataset. Table [2] shows the running time of ap- 
plying the best blocking function (for maximum canopy 
size 10, 000) for each of the algorithms (conjunctions be- 
ing similar to applying a single hash function are omit- 
ted); these numbers are averaged over the ~ 140 K movie 
entities and over 5 repeated applications of the blocking 
function on the entire dataset. We note that applying each 
of the blocking functions requires a negligible amount of 
time (always under 20/is per record), and BlkTrees don't 
add any discernible computational cost. 

Tree Size: Table[2]also shows the height of the tree for CT 
and HBT (averaged over the 5 folds of cross-validation). 
It is noteworthy that HBT obtains similar recall with a 
shorter BlkTree than CT. This is because the BlkTree con- 
structed using HBT is able to selectively create longer 
branches only when necessary. The longer tree for CT 
explains the higher blocking time per record. 

8.2.4 Summary of Results for Restaurants 

We present a very brief summary of our results on the 
restaurants dataset; restaurants displayed a similar gen- 
eral trend as movies, and a detailed study of restaurants 
is omitted due to space constraints. Table |5(f)| presents 
the overall recall for HBT and HBT-ND compared against 
C and C-ND, varying the sizes of the maximum canopy: 
(1) We note that both the disjoint and non-disjoint ver- 
sions of HBT significantly outperform the disjoint and 
non-disjoint versions of conjunctive blocking. (2) Fur- 



ther, as with movies, the recall achieved by HBT is very 
high on restaurants, and very close to 1 with non-disjoint 
blocking even for small canopy sizes. 
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